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Abstract 

This paper describes a graph clustering algorithm that aims to minimize the normalized cut 
criterion and has a model order selection procedure. The performance of the proposed algo- 
rithm is comparable to spectral approaches in terms of minimizing normalized cut. However, 
unlike spectral approaches, the proposed algorithm scales to graphs with millions of nodes 
and edges. The algorithm consists of three components that are processed sequentially: a 
greedy agglomerative hierarchical clustering procedure, model order selection, and a local 
refinement. 

For a graph of n nodes and 0{n) edges, the computational complexity of the algorithm 
is 0(r2 log^n), a major improvement over the 0{'n?) complexity of spectral methods. Ex- 
periments are performed on real and synthetic networks to demonstrate the scalability of 
the proposed approach, the effectiveness of the model order selection procedure, and the 
performance of the proposed algorithm in terms of minimizing the normalized cut metric. 

Keywords: Graph Clustering, Normalized Cut, Model Order Selection, Large Scale 
1. Introduction 

Clustering or partitioning of nodes in networks or graphs is an important task that has 
many applications in a diverse range of fields. It has been used for many years to study 
social networks [l| and continues to be employed in the field of sociology to explore social 
interactions. More recently it has been employed in the study of biochemical networks [2I, [sl, 
biological neural networks jl, [El, and transport and communication networks. 

There are three components to the clustering task: (i) choosing the number of clusters; 

(ii) selecting a criterion that measures the merit of each candidate cluster allocation; and 

(iii) identifying an algorithm that searches for the optimal clustering. Some performance 
criteria can be used both to select the number of clusters and choose the clustering, e.g., 
modularity or information-theoretic criteria based on the minimum description length [3]. 

There is no universally-accepted performance criterion and indeed the most appropri- 
ate criterion can vary depending on the application domain and the goal of the clustering. 
Modularity, for example, focuses on network structure and places primary value on direct 
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connections between nodes. On the other hand, clustering algorithms based on Markov ran- 
dom walks 0, Isf also value indirect connections and network flow. In this paper we select the 
normalized cut criterion jof, which simultaneously encourages intra-cluster similarity while 
penalizing inter-cluster similarities. Methods based on this criterion have been employed 
successfully in a wide range of applications (H, [sl' 12|. The normalized cut criterion is related 
to the conductance of the underlying graph [l3|. Furthermore an implicit duality between 
normalized cut and normalized association exists; the former encourages clusters that are less 
connected to each other and the latter encourages clusters whose nodes are well-connected. 

When partitioning a graph into clusters, the cut associated with cluster i is the sum 
of the weights of the edges between nodes in i and nodes in other clusters. Intuitively, 
minimizing the maximum cut (or minimizing the average cut) identifles clusters that have 
the weakest ties between them. The problem is that the minimum cut will often be achieved 
by identifying many very small clusters, which provides little insight into the underlying 
structure of the graph. The normalized cut metric was introduced by Shi and Malik in |9| 
to address this shortcoming. It normalizes each cut by dividing it by the total weight of 
the associated cluster. This has the effect of penalizing very small clusters because they 
generally have low total weight. 

Minimizing normalized cut is an NP-complete problem jof. Shi and Malik illustrated 
that the minimization could be relaxed to form a generalized eigenvalue system, whose (dis- 
cretized) solution corresponds to the minimum normalized cut. This has led to the develop- 
ment of a spectral clustering; it involves calculating eigenvectors of the identifled system. In 
order to identify a partitioning of n nodes to k clusters, some techniques perform recursive 
bipartitioning j9[, and thus require the repeated identiflcation of two eigenvectors. Other 
approaches strive to identify k clusters directly by calculating the k smallest eigenvectors of 
the underlying graph Laplacian. One of the concerns about these methods is computational 
complexity; even when employing fast eigendecomposition methods such as the Lanczos 
iterative technique, complexity grows rapidly with n. 

In this paper, we propose an agglomerative clustering algorithm that strives to minimize 
the normalized cut (or equivalently, maximize the normalized association). It is a fast, 
scalable algorithm, with almost linear complexity in the number of nodes for relatively 
sparse graphs. Performance evaluation using a range of benchmark and observed graphs 
indicates that the algorithm identifies partitionings that have average normalized association 
metrics as large as those of the partitions identified by the spectral clustering techniques. 
We also propose a method for identifying important partitioning scales which can be used 
to automatically select the number of clusters. To the best of our knowledge, this is the first 
model order selection method applicable to the normalized association maximization that 
is scalable. Through our experiments, we will show how effective our model order selection 
criterion is for both synthetic and real networks. 

The rest of the paper is organized as follows. In Section |2]the major approaches for mini- 
mization of normalized cut and model order selection are reviewed. The problem formulation 
of normalized association is re-stated in Section |3l Our proposed algorithm is detailed in 
Section HI The proposed algorithm is then compared to the state-of-the-art clustering algo- 
rithms in Section [51 The concluding remarks are provided in Section [61 
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2. Related Work 



The identification of clusters in graphs and networks has received significant attention. In 
our review we focus on a representative set of algorithms that minimize the normalized cut 
with or without spectral decomposition. We also review the existing methods for selecting 
the number of clusters. 



2.1. Optimization of Normalized Cut 

Spectral partitioning of graphs was first proposed by Donath and Hoffman in the 1970's [14 



Interest in the techniques was renewed in the 1990's when Pothen et al. described an algo- 
rithm for bi-partitioning using the Fiedler vector 15|. Hendrickson et al. and Karypis et al. 



contributed with multilevel algorithms for more efficient spectral partitioning (16|, [17 . 

The normalized cut metric was introduced by Shi and Malik in jof. They demonstrated 
how the bipartitioning task, with the objective of minimizing the normalized cut, could 
be relaxed to construct a generalized eigenvalue problem, and was thus related to spectral 
partitioning. The eigenvector corresponding to the second-smallest eigenvalue of the graph 
Laplacian identifies the bipartitioning (the real values in the eigenvector must be mapped to 
two discrete values for partitioning). This established the connection between minimization 
of normalized cut and spectral partitioning. Shi and Malik proposed a recursive bipartition- 
ing scheme in order to partition a graph into k clusters. In 18| , Meila and Shi proposed an 
algorithm that calculates k eigenvectors (thereby associating k real values with each node 
in the graph) and then uses a clustering algorithm, such as fc-means, to do the partitioning 



in 3? . Ng et al. observed in [19 



18[ is susceptible to failure when 



that the algorithm in 

there is substantial variation in the degree of connectivity between clusters. They proposed 
an alternative algorithm that uses a different normalization in both the eigenvalue problem 
and the construction of feature vectors prior to the application of /c-means. 

All of these algorithms involve a computationally expensive eigendecomposition. To 
address the computational difficulties for large graphs, Fowlkes et al. described a procedure 
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that uses the Nystrom method to reduce the complexity of the eigenvalue problem 
This does significantly reduce the computational overhead, but it is not enough to make the 
eigendecomposition algorithms scalable to large grap hs. Yan et al. recently described an 
algorithm for fast approximate spectral clustering [20| , but the focus is not on clustering for 
graphs (rather it addresses real- valued feature vectors). 



Dhillon et al. introduced a much faster algorithm for minimizing normalized cut in [11 



the graph is first greedily coarsened, then the coarsened graph is partitioned using a region 
growing procedure [l3] and finally weighted kernel /c-means clustering is applied to each 
partition to refine the clustering. 

Other methods also exist that strive to optimize related cost functions; for example 
Sharon et al. proposed a scalable hierarchical clustering using ratio association (the normal- 
ization term is the number of nodes in clusters) j2l|. Other examples include the work by 

1 that use 



Ding et al. [22[, Sharon et al. [23[, Akselrod-Ballin et al. [2J], and Corso et al 



the sum of internal weights of clusters as the normalization term. 

2.2. Model Order Selection 

The problem of selecting the number of clusters is almost as old as clustering itself. In 
the following discussion, k* denotes the number of clusters that a given approach identifies 
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as the true number of clusters. 

Numerous authors have used some notion of quahty of clustering, usually based on some 
definition of the inter- and intra-cluster distances, to identify the number of clusters. Let 
F{k) denote this notion for a given partitioning to k clusters. F{k) is then examined for 
various values of k. Partitionings for different values of k can be obtained from a hierarchical 
clustering, or alternatively, by several fiat partitionings. 

If F is not a monotonic function of k, the approach is usually to take k* = argmax^ F[k) 
or k* = argminfc-F(fc) (depending on the definition of F). The majority of the methods 



discussed in J26[| and [27| are of this type; Calinski and Harabasz [28|, C- index [29|, Baker 
and Hubert |30|, and cubic clustering criterion 3l| are among the most effective ones 26 . 
Tibshirani et al. 32| define F{k) as the gap, that is the difference of the average pairwise 
distance of the data points of the clustering at k^^ level and the expected value of the same 
measure of some reference model. This is similar to modularity |6| (discussed later) in the 
sense that it compares the clustering results to a reference model. None of these methods 
are directly applicable to graph clustering algorithms; the calculations of the defined metrics 
require pairwise distances which are not immediately available from a graph representation. 
Possible distance metrics include the shortest path 33 1 or the diffusion distance jsj; however 
the shortest path is very sensitive to noise and the calculation of the diffusion distance 
requires eigendecomposition. 

In the case where F is a monotonic function of k, the only extremal values of F{k) 
correspond to trivial values of k. Hence, the value of F{k) corresponding to two or more 
choices of k are examined to quantify the significance of a given level of a hierarchical 
cluste ring 34 -36|. In order to assess the clustering at level k of the hierarchy, Gnanadesikan 
et al. 34 1 propose the fraction F{k)/ F{k — 1), Arnold ^\ use the value of F{k) — F{k — 1), 



Krzanowski and Lai 
Pederson and Kulkarni 



351 employ the fraction \F{k) - F{k - 1)|/|F(A; + 1) - F{k)\, 
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and 
The 



suggest using the fraction 2 x F{k)/ {F{k — 1) + F{k + 1)) 
latter two approaches are the most similar to what we propose in Section H] in the sense that 
they use the preceding and succeeding levels of the hierarchy to obtain the significance of 
the clustering at level k. However the approaches in 35|, |36| are potentially susceptible to 
noise due to the division; small perturbations in the weights could lead to dramatic changes 
in the selected model order. 

Other approaches to selecting the number of clusters are based on the adoption of (semi- 
)parametric models for the structure of the graph. This allows the application of model 
selection techniques based on concepts such as the Bayesian InformationCriterion (BIG) 



and Akaike Information Griterion 39 



Despite the strong theoretical support for these 
methods, the adoption of a parametric model for the graph structure is undesirable. The 
models are often overly restrictive and do not ade qua tely capture the properties of many 
real-life networks. 



41[ that the input data are normally 



An example is the requirement in |40l . 
distributed (after projection of the graph into a real space). The more general methods based 
on mixture models do not scale well to very large graphs; even the recent approaches have 
only been applied to graphs with a few thousand nodes |42|-|44 . 



Some heuristics are based on the sizes of the clusters that are merged at different levels 



of the clustering hierarchy [45|, |46 



46j suggest that when two clusters 



The authors in 

with large number of nodes are merged, a significant amount of detail is lost; hence such an 
instance is potentially where a hierarchical clustering algorithm should stop. The authors 
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of 451 propose a similar approach. The main drawback of these approaches is that only the 
granularity of the clusters are taken into account and the number and the weights of the 
edges are simply ignored. 

A well-known and effective method of selecting the number of clusters is to examine the 
eigenvalues of the Laplacian of the graph that is to be clustered 19|, |47|, |48|] . For a graph with 



isolated connected components the multiplicity of the eigenvalue zero is equal to the number 
of clusters. Any other graph with well-separated clusters can be considered as a perturbation 
of this ideal case. Matrix perturbation theory states that the stability of the eigenvectors of 
a matrix is proportional to the eigengap (the difference between two successive eigenvalues). 
Von Luxburg [48| suggests using k* = argmax^ (A^ — A^+i). A large eigengap at k* is the 
case in which spectral algorithms using the Laplacian perform most successfully [loj. A 
more robust criterion is proposed by Zelnik-Manor and Perona 49[ that uses eigenvectors 
instead. Despite the solid theoretical support behind these approaches, the requirement of 
eigendecomposition makes them impractical when large graphs are being clustered. 

There have been attempts to identify the number of clusters using stability analysis. 
"Stability" has been defined differently by different authors, but they all based on the same 
intuition; with respect to some algorithm, a stable clustering is one that behaves relatively 
consistently in the presence of some controlled perturbation. The perturbation could be in 
terms of noise 50|, sampling subsets from the input 5l|, |52|, or random projection of a high 



dimensional data into a lower dimensional space [53|. These approaches require several runs 



of clustering for every value of k which makes them computationally expensive. Furthermore, 
Ben-David et al. 52[ warn against using stability analysis in this context, and they suggest 
that this family of model order selection techniques is not suitable for selecting the number 
of clusters in general. The intuition behind their work is that when an underlying objective 
function, F{k), has several local optima of relatively similar values, a clustering algorithm 
might be trapped by any of them, even when k is the true number of clusters. The difference 
in clustering solutions is interpreted as instability and the model order rejected, but the 
behavior is caused by the imperfect nature of the clustering algorithm. 

Two more recently proposed methods to select the number of clusters are based on 
optimization of the quality metrics of modularity ^ and description length Both methods 
simultaneously address both clustering and the selection of the number of clusters; they 
strive to optimize the objective function over all possible partitionings. These methods are 
discussed in more detail in Section El 



3. Problem Formulation 

Let G = {V,E,w) be a weighted graph having n = \V\ nodes and m = \E\ edgefl 
We assume that edge weights w{u,v) = w{v,u) > are non-negative and symmetric, that 
w{u, w) = if {u, v) ^ E, and that for (u, v) e E, the weight w{u, u) > is indicative of the 
similarity between nodes u and w; that is, the larger the weight, the more similar the nodes. 
We also allow for self-weights, w{u,u) > 0. 



'*In this work, we assume we are given the graph on which we wish to perform clustering. We do not 
address the problem of forming a graph from data, which arises when one applies graph clustering methods 
to general data sets; see, e.g., |48.. .54 .55i| . 
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For a fixed number of clusters, k, we measure tlie quality of the partition via the 
normalized cut metric joj, defined as follows. For a node u, denote the degree of u by 
yw{u,v), and for a subset U C V of nodes, let d{U) — J^ueu^i'^) denote the 
cumulative degree of the subset. Similarly, for two disjoint subsets of nodes, Vi and V2, let 
w{Vi, V2) = J2ueVi '^veV2 ^(^' ^0 denote the sum of the weights of edges with one end in Vi 
and the other end in V2. Let Ck denote the partition of nodes to k clusters, = {Ci, . . . , C^} 
where Ci is the subset of nodes affiliated to cluster i. The normalized cut metric is defined 
as 

,c..ic,.±-^^^. ,1, 

Minimizing the normalized cut can be interpreted as minimizing the similarity of nodes 
in different clusters, relative to the degree of each cluster. Alternatively, maximizing the 
intra-cluster similarity can be achieved by maximizing the normalized association, defined 
as 

NA.oc(CO ^ t (2) 

Moreover, maximizing normalized association is equivalent to minimizing normalized cut 
since w{Ci,Ci) + w{Ci,V\Ci) = d{Ci), and hence NAssoc(Cfc) = /c — NCut(Cfc). For the sake 
of readability, we adopt the maximization of normalized association as our goal in developing 
a clustering procedure. 



4. GANG: Greedy Agglomerative Normalized Gut 

In this section, we describe a greedy algorithm for building an agglomerative clustering 
on a graph. Although we do not make guarantees about its accuracy, the algorithm is fast 
on sparse graphs and yields excellent performance on a variety of examples, as illustrated 
in Section [51 Moreover, GANG has a model order selection criterion and does not have to 
be provided with the number of clusters a priori. The algorithm consists of three steps: 
agglomerative clustering, model order selection, and refinement. 

4.1. Agglomerative Clustering 

We propose greedy maximization of normalized association via an agglomerative hierar- 
chical clustering. In the following discussion, note that the number of clusters at stage k of 
the hierarchy is k. At stage k of the hierarchy, using the given partition, Ck, two clusters are 
merged to form a new partition, Ck-i- The two clusters to be merged are chosen greedily so 
that normalized association of stage /c — 1 is maximizedjf] 

Initially, C„ is a function that maps nodes to unique clusters, 1, . . . ,n. The degree of 
every node u, d{u), is calculated. Furthermore, for every edge, {u,v) G E, the improve- 
ment in normalized association by its contraction to obtain is stored in A{u,v) = 



^We note that a similar greedy merging algorithm is alluded to by Shi and Malik in Q, however they also 
suggest first projecting each node into 3?*^ using the first k eigenvectors of the graph Laplacian, and running 
an algorithm such as fc-means to obtain an initial clustering. Our approach requires no eigendecomposition 
and performs no such initialization. We also note that, although Shi and Malik mention having experimented 
with greedy merging, results for this approach have not been reported in the literature. 
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2w{u,v)/ {d{u) + d{v)), so that a matrix of the improvements is constructed. Here we as- 
sume that initially there are no self-loops, but if there are, the value of improvements are 
computed by (jl]) presented shortly. Throughout the agglomerative clustering procedure, 
(i(-), w{-,-), and A(-, ■) are updated at each stage as described below. 

At each iteration, the pair (u*, v*) = argmax^^ A(u, v) is selected for merging to create 
a larger cluster, uv*. The degree is computed for the newly constructed cluster, d{uv*) = 
d{u*) + d{v*). The weights and the improvement matrices are updated by removing the rows 
and columns corresponding to u* and v*, and inserting a new row and column corresponding 
to uv* (rows and columns are not removed or added in our implementation, but this is the 
practical effect). The weight matrix is updated as follows: 

w{uv*, x) =w{x, uv*) = w{u*, x) + w{v*, x), (3) 

and the improvement matrix update is: 

. , ^ , w(uv*,uv*) + w(x,x) + 2w(uv*,x) w(uv*,uv*) w(x,x) 

d[uv*) + d[x) d[uv*) d[x) 

for all the clusters, x, adjacent to either u* or v*. The self-weights are also calculated by: 

w{uv*, uv*) = w{u*, u*) + w{v*, V*) + 2w{u*, V*). (5) 

For all pairs of nodes {u,v) not adjacent to u* and f*, the weights w{u,v) and improve- 
ments A{u,v) remain unchanged. The above sequence of steps is repeated n — 1 times to 
form the clustering hierarchy. 

4-2. Model Order Selection 

Many of the clustering algorithms require the number of clusters to be provided to the 
algorithm a priori. However such information is often not available in practical situations 
making the decision about the number of clusters an issue in itself. 

It is worth noting that the stage number (k) that maximizes NAssoc(Ca:) does not neces- 
sarily correspond to a meaningful number of clusters]^ Here we propose a simple but effective 
approach to model order selection. Let Q = argmax^^NAssoc(Cfc) denote the partition that 
maximizes normalized assocation over all partitions of V into k clusters. To carry out model 
order selection, we examine the curvatureij of NAssoc(C^) which we define as 

Curv(fc) = (NAssoc(C^) - NAssoc(Cfc_i)) - (NAssoc(Cfc+i) - NAssoc(Cfc)) 

=2NAssoc(Cfc) - NAssoc(q_i) - NAssoc(Cfc+i). (6) 



^To see this, consider an unweighted graph that consists of two isolated chains, each having 4 nodes. The 
true number of clusters is trivially 2 resulting in NAssoc(C2) = 2; however if one groups the adjacent nodes 
to obtain 4 groups of node pairs, the resulting value of normalized association would be NAssoc(C4) = 2.66. 

^The reason we call this metric the curvature is its similarity to the central approximation of the 
second order derivative which is defined for a continuous function, /(■) as f"{x) ~ d1[f]{x)/h'^ = 
[f{x + h) — 2f{x) + f{x — h)] //i^. By substituting /i = 1, we get the negative of our curvature equation 
(the negation is to have positive peaks). 
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The first term of the above addition is the improvement of normahzed association moving 
from stage k to k — 1 of the hierarchy and the second term is the improvement moving from 
stage k + 1 to k. 

The function Curv(A;) captures the notion that a particular number of clusters, k, iden- 
tifies meaningful structural similarities embodied in the graph if it provides a normalized 
association which is significantly larger than the best partition with one additional cluster 
{k + 1) and little can be gained by reducing the number of clusters to A; — 1. 

In practice we do not have access to the optimal partitions, so we cannot evaluate the 
exact value of the curvature function. Instead, we approximate the curvature by using the 
normalized association values for the partitions returned by the agglomerative step of our 
algorithm. 

Note that the model order selection step of the algorithm could be used as a model 
order selection step for any other algorithm that maximizes the normalized association and 
generates a clustering hierarchy. Furthermore, this step of the algorithm can be considered 
optional if there is a prior knowledge about the true number of clusters. 

4-3. Refinement 

Greedy algorithms can get trapped by local optima. This is also the case with the 
agglomerative step of our algorithm, especially when the clusters are not clearly separated. 
After selecting the number of clusters either using the model order selection rule described 
previously or prior knowledge, a refinement step is invoked in order to improve the initial 
clustering results. The nodes are moved across the clusters to further improve the value of 



normalized association. A similar approach is taken in [56|, but groups of nodes are moved 
from a cluster to another instead of individual nodes. If we try all possible moves, we end 
up performing an exhaustive search of dividing n nodes into k clusters. This defeats the 
purpose of developing the fast agglomerative clustering step. Instead, we look only at the 
boundary nodes, i.e., the nodes that have at least one neighbor in another cluster. 

Initially the set of all the boundary nodes is identified. We use an n x k matrix to keep 
track of the neighborhood information of the boundary nodes. If B denotes this matrix, and 
/ denotes an n-dimensional vector: 



otherwise ^—^ 



(7) 



For each of the boundary nodes, improvements in normalized association by moving from 
their current cluster to each of their neighbors are calculated. This improvement for node u 
moving from cluster i to cluster j is 

. w{C,,Ci)-2I{u) w{C,,Cj) + 2B{u,j) 
^""'''•^^ d{C,)-d{u) d{Cj) + d{u) 

d{a) d{Cj) 

If no move leads to improvement, the node stays in the cluster to which it belongs. If 
one or more moves result in improvement, then the node is moved to the cluster that results 
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in maximum improvement. If u is moved from Cj to Cj, the corresponding cluster degrees 
and associations are updated: 



d{ar- = diet" - d{u) (9) 

^^^.^ncw _ d{CjY^^ + d{u) (10) 

w{C,, C,)"^- = w{C.,, Qf" - 21{u, zy'" (11) 
wiC,, C,r- = wiC,, C,t^ + 2B{u, (12) 

and for all the neighboring nodes of u denoted by entries of B and / are updated as 
follows: 



J(t;)°«^- = I{vy^'^ + W{U,V) 
J(t;)-«" = I{vf^'^ -W{U,V) 



B{v,jr- = B{v,jr"' + w{u,v) 

Biv.iY'-" = B{v,iy^'^ -w{u,v) 
B{v,jr- = Biv,jr'^ + w{u,v) 



live Cj, (13) 
if e a, (14) 

if f ^ Cj and v ^ Cj, (15) 



and finally 



5(u, z)'^^" = /(u)°i^ (16) 
I{ur- = B{u,jf^, (17) 
Biu,jr- = 0. (18) 

While performing the updates to ffT5]) . the set of boundary nodes is also updated. 
Whenever a node is moved from cluster i to cluster j, its neighbors in cluster i are added 
to the set of boundary nodes. For some node v neighboring u, moving u might result in 
B{v, z)"'^^ becoming zero in f|T5l) for all i; i.e., some nodes might be removed from the set of 
boundary nodes. 

One pass through all the boundary nodes is considered a single refinement iteration. 
When an iteration causes no positive improvement, the refinement procedure is stopped. 
Alternatively, one could specify the maximum number of refinement iterations. Note that 
although we prohibit the refinement step from emptying any cluster, we have not observed 
such an attempt in our experiments. 

4.4- Implementation and Computational Complexity of CANC 



We take a similar approach to [57[ when implementing the agglomerative clustering pro- 
cedure of GANG. Max-heaps and balanced binary trees are used to store the rows of the A 
matrix and the adjacency matrix. A separate heap is also used to store the maximum of 
each row of A. This leads to the complexity of 0{mh log(n)) for the agglomerative clustering 
procedure, where h is the heig ht of the generated dendrogram and m is the number of edges 



with non-zero weight (see [57| for details). 

The model order selection step of the algorithm requires 0{n) computations. The com- 
putational requirements are much less than those of the methods analyzing the eigenvalue 
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Algorithm 1 GANG: Greedy Agglomerative Normalized Gut 



GREEDY AGGLOMERATION 

1: Build the initial A matrix, and NAssoc(C„) ^ Sugy ^(^) ^)/'^(^) 

2: for /c = n — 1 to 1 do 

3: ^ argmaxjj A(z, j) 

4: NAssoc(Cfc) ^ NAssoc(Cfc+i) + A{i*,j*) 

5: Update rows and columns of A, the weights, and the degrees corresponding to clusters 

i* and J* 
6: end for 

MODEL ORDER SELEGTION 
1: for /c = n — 1 to 2 do 
2: Galculate Gurv(A;) (ED 
3: end for 

4: k* ^ argmaxfe Gurv(A;) or provided by the user 
REFINEMENT 

1: Obtain flat partitioning for level k* 

2: Build the initial B and / ([7]), and refine ^ true 

3: while refine do 

4: 5^0 

5: for n = 1 to 72 do 

6: if u G Ci is on the boundary and max^ 5{u, i,j) > then 

7: Move u from Ci to Cj*, where j* = argmaxj 6{u,i,j) 

8: Update B, I, the weights, and the degrees (P- [TS!) 

9: 6 <- 6 + 6{u,iJ*) 

10: end if 
11: end for 
12: if (5 = then 
13: refine ^ false 
14: end if 
15: end while 



fall-off which require eigendecomposition (e.g., 49| and 47l|). Performing an eigendecompo- 
sition to study the eigenvalue fall-off for all of the eigenvalues requires 0{n^) operations. 

To implement the reflnement step, we use the map data structure to store and update 
B. Every row of B is stored in a map {C++ STL implementation of the map data structure 
is used |58|). Updates to (IT^ are performed in constant time. To access each map entry 
of a row of B to insert, update, or delete, no more than 0{logk) operations are required, 
because the maximum number of clusters connected to a boundary node is strictly less than 
k. Updates f|T3|) . f|T^ . and (fT5|) each take no more than O(logfc) operations and are repeated 
for all neighbors of a node that is moved. We have observed through experiments that nodes 
with larger degrees do not tend to be on the boundaries and hence the average number of 
neighbors to be updated is practically smaller than the average node degree. ffTBl) . ffTS]) . and 
any insertion to or deletion from the set of boundary nodes are performed in 0(logk) time. 
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(fT7|) is trivial and is performed in a constant time. Not more than n nodes can be moved 
in a given iteration. When node i is moved, 0{di\ogk) operations are required, where di 
is the degree of node i. Summing over all boundary nodes, the complexity is 0{mlogk) 
because Yl^=i = 2m. Hence the number of operations in a single iteration is at most 
0{m\ogk). Throughout our experiments we have observed very small number of refinement 
iterations even when the refinement is applied to the networks with millions of nodes and 
edges. Furthermore the algorithm is capable of limiting the number of iterations. Hence the 
complexity of the average number of operations of the whole refinement step does not exceed 
0{m log k). 

The total computational complexity of GANG is dominated by the agglomerative clus- 
tering procedure. In practice, the height of the resulting dendrogram is often O(logn). Also, 
the graphs that are studied are usually sparse and hence m = 0{n). Therefore the com- 
plexity of the agglomerative clustering step of our algorithm is 0{n log^ n) for many real-life 
graphs. 

The worst case runtime happens when the graph is complete and the dendrogram is 
totally unbalanced. There are n — 1 agglomerations to construct the whole hierarchy each 
of which requires at most 0{n\ogn) operations. Hence the worst case complexity for the 
agglomerative clustering step of the algorithm is 0(n^log?T,) which is not expected to occur 
in practical situations. 

The memory requirement of GANG is 0(m) which is equivalent to 0{n) for sparse graphs. 
In summary, in many practical cases (relatively sparse graphs and balanced dendrograms), 
the computational complexity is 0(nlog n) and the memory requirement is 0{n). GANG 



can be downloaded from http://www.ece.mcgill.ca/~coates/software/ 



5. Experimental Results 

In this section we provide a comparison of the performance and the runtime of our 
proposed algorithm GANG with a selection of the state-of-the-art graph clustering algorithms 
from the literature whose implementations were downloaded from the corresponding authors' 
websites. Experiments were performed on an Intel 3.0 GHz Gore 2 Quad GPU with 8 GB 
of RAM and Ubuntu 9.10 operating system. 

5.1. Comparing Algorithms 

We compare to the following algorithms: Shi and Malik (recursive NGut) iiMeila and 



Shi (k-way NGut) [18|]; Ng, Jordan, and Weiss [19[; Dhillon, Gaun, and Kulis [HI]; Rosvall 
and Bergstrom [7|; and Blondel et al. 561]. The first four algorithms focus on maximizing 
NAssoc. Despite not addressing our criterion of interest directly, the other algorithms are 
included because they are scalable and represent the state-of-the-art in graph clustering. 
The discussed clustering algorithms ([sl, 18], 1^) are not scalable as they include eigen- 



decomposition. Dhillon, Guan, and Kulis proposed an algorithm that strives to maximize 



normalized association without requiring any eigendecomposition [11|. Similar to the other 
algorithms that address the maximization of normalized association, the Dhillon- Guan-Kulis 
algorithm requires the user to provide the number of clusters. After specifying the number 
of clusters, three steps are performed: a coarsening phase, a base clustering step using re- 
gion growing, and finally a refinement step using weighted kernel k-means with appropriate 
choices of the kernel and the node weights to maximize the normalized association. 
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Minimization of the criterion proposed by Rosvall and Bergstrom 0] results in a coarse- 
grained representation of the information flow through the network. Their proposed clus- 
tering objective aims to compress the underlying random walk on a graph without losing 
important network structures. To facilitate this, a coding is designed as follows: every clus- 
ter is assigned to a unique code; in a given cluster, every node is also assigned to a unique 
code. However two nodes in different clusters are allowed to be assigned to the same code. 
The algorithm then strives to minimize the average description length of a single step of a 
random walk. 

The algorithm by Blondel et al. targets maximizing modularity j56|. Modularity (Q) is 
defined as the total fraction of intra-cluster sum-weight minus the expected fraction if the 
edges (and weights) were distributed randomly while the node degrees were preserved: 



i=l 



where M = '^i'^jW{i^ j). Modularity provides a valuable metric of the connectedness of 
clusters, but a number of authors have demonstrated that it suffers from a resolution limit 



when used to select the number of clusters [59|] . The partitioning that maximizes modularity 



will generally not isolate clusters if the number of edges in the cluster is a small fraction of the 
total number of edges in the graph. The reason behind the resolution limit of modularity 
is the second term in the summation of (1191) : when the network gets larger, M increases 
monotonically. However the cluster degrees are not necessarily a function of the network size 
and are often bounded, regardless of the number of nodes 6^. This results in d{CiY / ^ 1 



and hence the value of modularity becomes dependent only on w{Ci)/M. By normalizing 



the summation by d{Ci) as suggested in 1^, the resolution limit phenomena is resolved; but 
we have 

i.e., maximization of ^normalized and normalized association are equivalent. 
5.2. Synthetic Graphs 

We first analyze the performance on synthetic graphs, for which the true clustering be- 
havior is known. We use benchmark graphs developed by Lancichinetti, Fortunato, and 
Radicchi 61| (LFR graphs) . These random graphs are designed based on the planted par- 
tition model j62|. Each node is assigned to one of k clusters. When edges are added to the 
graph, the probability that the edge is between nodes in the same cluster is 1 — yU and the 
probability that the edge joins nodes from different clusters is fi. The LFR benchmarks have 
heterogeneous cluster sizes with user-specified lower bound and upper bound, Cmin and Cmax, 
respectively. Furthermore the node degrees are upper-bounded by dmax and the average node 
degree is denoted by davg- As /i decreases, edges are increasingly likely to be intra-cluster, 
making the partitioning task easier. 

The Meila-Shi algorithm performs at least as well as the other spectral clustering algo- 



rithms (Ng- Jordan- Weiss [19| and Shi-Malik [9[) and hence we only display the Meila-Shi 
results. The local search of Dhillon-Guan-Kulis is also excluded often, because it results in 
negligible improvement while it introduces a very large computational overhead. 
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Figure 1: LFR benchmarks with Cmin = '20,Cmax — 50,dmax = 30, = 25, and variable proportions of 
inter-cluster edges (/i). In all of the above figures, Meila-Shi, Rosvall-Bergstrom, and GANC(refined) are 
overlapping. In (b) and (d) Jaccard index ~ 1 for the overlapping algorithms that corresponds to perfect 
extraction of the clusters. 

5.2.1. Maximizing normalized association 

Figure [1] examines how the algorithms perform with respect to maximizing NAssoc for 
1,000-node LFR graphs. In order to observe how the algorithms perform on graphs with 
heterogeneous clusters, we let the cluster sizes range from 20 to 50 nodesjfl For each value 
of /i, 100 graph realizations are generated. The value of NAssoc is divided by k to obtain a 
value between and 1 which represents the average NAssoc per cluster. 

The algorithms perform almost identically with the exception of the Dhillon et al. algo- 



rithm llj. The refinement step of GANG results in a significant improvement in the value 
of NAssoc. However, the local search of Dhillon's algorithm does not improve NAssoc sig- 
nificantly. Note that the Blondel et al. algorithm results in higher values of NAssoc due to 
selecting smaller number of clusters than the true ones (NAssoc(Cfc)//c is a monotonically 
decreasing function of /c); a fair comparison between algorithms is possible only when the 
number of clusters are equal. 

5.2.2. Comparing to the planted partitions 

The advantage of exploring performance on synthetic graphs is that a ground-truth parti- 
tioning is available. Because the LFR benchmarks are based on the planted partition model, 
there is the additional advantage that NAssoc is an appropriate criterion to adopt. Apart 
from some possible small errors due to the randomness inherent in the construction of the 
benchmark graphs, the ground truth partitioning will correspond to a maximum of NAssoc 



^In the real-life networks that we consider in this paper, the clusters have been observed to be of a limited 
size, regardless of the network size (Hoj : for example the Dunbar number suggests an upper limit of 150 nodes 
for clusters in a social network 16311. 
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for a given value of k. 

For comparing two partitions on the same graph, we use the Jaccard index 6J| which for 
two partitions, X and y, is defined as 



JI 



(a + b 



(21) 



where a, b, and c are, respectively, the total pair of nodes that are assigned to the same 
cluster in both X and 3^, the same cluster in X and different clusters in 3^, and the same 
cluster in y and different clusters in X. When two partitions are identical, the Jaccard index 
evaluates to one, and it decreases as the partitions deviate from each other. 

Figures [lb] show the closeness of the partitionings identified by the algorithms and the 
ground truth in terms of the Jaccard index. The algorithm by Blondel et al. deviates 
dramatically from the true clustering (Figure ITbl) . The reason is that the minimum and 
maximum cluster sizes are fixed for all the networks in addition to the average node degree. 
Hence the average number of edges per cluster remains the same making the proportion of 
the intra-cluster edges decrease. Due to the resolution limit of modularity, as the proportion 
of intra-cluster edges is decreased, modularity maximization algorithms tend to group several 
clusters into a single cluster. If we reduce n to 1000, the Blondel et al. algorithm leads to 
closer results to the ground truth. The algorithm by Dhillon et al. does not perform as well 
as other algorithms that strive to maximize NAssoc and the local search results in negligible 
improvement in terms of the Jaccard index. 
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Figure 2: The ring of 24 chques; the highest peak of curvature is at 24. 



5.3. Model order selection: the curvature metric 

We first provide an illustrative example of the use of the curvature metric for selecting 
the number of clusters in the partitioning. This example highlights the difference in behavior 



compared to the modularity metric used by modularity maximization algorithms [56 
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Figure 3: Curvature versus number of clusters for varying proportion of inter-cluster edges (/i). The true 
clustering indicated by red X. The peak of the curvature becomes less and less distinguishable by increasing 



/i. LFR benchmark graph with 1000 nodes and c„ 
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An example used by Good et al. [59|, to illustrate the resolution limit is the ring of 24 
clique^, each of which has 5 nodes and is connected to its neighboring clique by a single 
edge. The value of modularity is maximized by a 12-cluster partition that merges pairs 
of cliques together. A more natural clustering is to consider each clique as an individual 
cluster. Figure [2b] shows the curvature plot for the case of the ring of cliques. In addition 
to the peak of the curvature that is correctly located at 24 clusters, it is interesting to note 
that the other local peaks of the curvature are also meaningful; e.g., the peak at 12 clusters 
corresponds to the clustering that maximizes modularity. 

To see how the curvature indicates the true number of clusters for networks with more 
complex structures, we consider the LFR benchmarks. We fix the LFR parameters for a 
1,000-node graph and explore how the curvature changes as /i is varied. Figure E] indicates 
that the peak of the curvature plot correctly identifies the true number of clusters for fi 
in the range 0.1 - 0.5. As n increases, the main peak becomes less distinct; for n = 0.5, 
the second highest peak is almost as large as the primary peak. As the clusters become 
increasingly interconnected, the curvature plot provides a less clear indication of the true 



A clique is a subset of nodes in a graph which are fully connected. 
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number of clusters. 

Note that Figure [3] corresponds to a single realization of the LFR benchmarks. The 
empirical probabilities that the highest peak of the curvature corresponds to the true number 
of clusters for 100 realizations of the LFR benchmark with parameters as in Figure |3] are 
Pi = [1.0 1.0 0.96 0.60 0.42 0.06], for = [0.1 0.2 0.3 0.4 0.5 0.6] when no refinement 
step is applied; furthermore the empirical probabilities that one of the two highest peaks 
corresponds to the true number of clusters are p2 = [1.0 1.0 1.0 0.98 0.75 0.13]. 

These probabilities are obtained using the approximate curvatures derived from the ag- 
glomerative clustering procedure. To achieve a better insight into the model selection capa- 
bilities of the curvature metric, we derive better approximations by applying the refinement 
step to every level of the hierarchy and then recomputing curvature estimates. With this new 
procedure, we obtain pi = [1.0 0.99 0.99 0.95 0.87 0.77] andpa = [1.0 0.99 0.99 0.96 0.88 0.78]. 
This indicates that curvature can provide a good indication of model order even for fi = 0.6. 

The model order selection algorithm of Rosvall-Bergstrom algorithm quite successfully 
indicates the true number of clusters in the case of LFR benchmarks; the empirical proba- 
bilities for the Rosvall-Bergstrom algorithm is pi = [1.0 1.0 1.0 0.98 0.85 0.83]. The Blondel 
et al. algorithm does not successfully identify the true number of clusters when /i increases: 
Pi = [1.0 0.99 0.98 0.79 0.07 0.0]. Note that when n = 5, 000, the failure of the Blondel et al. 
algorithm becomes more evident and pi turns to a 0% success for any /i (this is reflected in 
Figured]). However the refined GANG and Rosvall-Bergstrom algorithms obtain the same 
success ratios when n = 5, 000. 

5.4- Real networks 

In this section, we examine the behavior the GANG and compare it to the other al- 
gorithms for graphs representing real networks. We first examine the behavior for small 
networks where there is knowledge of the ground truth partition. We then experiment with 
large networks, which allows us to assess the scalability of GANG. 

5.4-1- Small networks 

We conduct experiments using the Zachary karate club network 65| , the football network 
jsj, and the political books network^. The Zachary karate club network portrays friendships 
in a karate club. During the polling period, there was a dispute between the manager and 
the instructor which led to the establishment of a new club by the instructor. The karate 
students then split into two groups, either staying with the original club or following the 
instructor to the new club. The two clusters associated with the network correspond to these 
two groups. Each node in the football networks corresponds to a team in US college football. 
There are 11 conferences and 5 independent teams. Each edge corresponds to the existence 
of a match between the connected nodes. The independent teams can be interpreted as 
outliers. Nodes of the political book network correspond to the political books sold by 
Amazon.com and are categorized as neutral, conservative, and liberal. The edges between 
pairs of nodes correspond to frequent co-buying of the pairs of books by the same buyers. 
The three networks are unweighted. 



Collected by V. Krebs, http://www.orgnet.com 
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Tables [T] and |2] compare the performance of GANG with other clustering algorithms 
for these small, well-known datasets. The tables provide the average NAssoc per cluster 
(NAssoc//c) and the Jaccard index. The "true" number of clusters is indicated within paren- 
theses beside the name of each networlj^. In Table [H all algorithms (including GANG) are 
informed of the true number of clusters and instructed to generate a partitioning with that 
number of clusters. Table [1] indicates that the spectral clustering algorithms (Ng- Jordan- 
Weiss, Meila-Shi, and Shi-Malik) achieve similar performance in terms of NAssoc. We note 
that when k is specified for these three networks, GANG identifies partitionings with average 
NAssoc as large or larger than those of the partitionings identified by the spectral clustering 
techniques, with the exception of Shi-Malik in the case of the football network. 



Tabic 1: Small real networks: k known in advanee (average NAssoc per cluster/ Jaccard index). Maximum 
NAssoc is indicated by bold text. 





Karate(A: = 2) 


FootbaU(fc = 11) 


PolBooks(A: = 3) 


GANG (refined) 


0.872/0.89 


0.704/0.820 


O.88/0.67 


Dhillon-Guan-Kulis 


0.820/0.64 


0.700/0.820 


0.87/0.64 


Ng- Jordan- Weiss 


0.872/1.00 


0.701/0.825 


0.82/0.57 


Meila-Shi 


0.869/0.89 


0.696/0.820 


O.88/0.67 


Shi-Malik 


0.872/1.00 


0.706/0.825 


0.86/0.67 



Table [2] compares the performance of algorithms that select the number of clusters based 
on some aspect of the data. The number of clusters selected by each algorithm is shown 
within parentheses after the Jaccard index. Rosvall's algorithm uses a description length 
metric to select the number of clusters 0]; Blondel's algorithm chooses the partitioning that 
maximizes the modularity 56|. A direct comparison of NAssoc is not valid when k is not 
fixed. The first row of this table shows the performance of GANG when the curvature plot 



Table 2: Small real networks: Jaccard index for the estimated model orders . 





Karate 


FootbaU 


PolBooks 


GANG (curv) 


0.80 (fc 3) 


0.76 (fc 13) 


0.69 (fc = 2)) 


Blondel et al. 


0.53 (k = 4) 


0.70 (fc = 11) 


0.67 {k = 3) 


Rosvall-Bergstrom 


0.71 (fc = 3) 


0.83 (fc = 12) 


0.63 (fc = 5) 



is used to select the number of clusters. Although the selected number of clusters does not 
correspond to the true number of clusters for any of the networks, all three values can be 
explained. In the karate club network there is a group of students who have weak ties to 
other members of the network and these are identified as a third cluster; in the football 
network, GANG isolates the independent teams; in the political books network, GANG does 
not identify a neutral cluster and assigns each of the neutral books to either the liberal or 
conservative cluster. 



^^Note that the "true" number of clusters is something of an artificial social construct and does not 
necessarily correspond to a partitioning that maximizes any meaningful graph clustering metric. 
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5.4-2. Cortical Networks 



Here we study the networks presented in 66|] which are weighted graphs, each with 998 



nodes. The nodes correspond to small regions of the human cerebral cortex and the edges 
correspond to cortico-cortical axonal pathways. The networks are developed for five patients 
(the extraction is performed twice for patient A). We first perform a comparison of the 
competing algorithms, then we discuss the clustering results of GANG. 

When applying the Rosvall-Bergstrom and Blondel et al. algorithms the used does not 
have the freedom to choose the number of clusters. Hence we repeat the experiment twice 
to make a meaningful objective comparison in terms of NAssoc. Table [3] lists the average 
NAssoc of every clustering algorithm and patient. In the first subtable, k is set to the 
model order selected by the Rosvall-Bergstrom algorithm. Then, in the second subtable, k 
is set to the model order selected by the Blondel et al. algorithm, and in the third subtable, 
peaks of the model order selected by curvature are used. Note that the Shi-Malik algorithm 
outperforms the other two spectral algorithms for these datasets; hence we only include the 
Shi-Malik results. 



Table 3: NAssoc of the cortical networks. Maximum NAssoc is indicated by bold text. The model order 
selected for each column in each sub-table is identified within parentheses. 





A-f 


A-2 


B 


G 


D 


E 


Rosvall-Bergstrom 


0.725 (62) 


0.706 (68) 


0.738 (61) 


0.706 (62) 


0.691 (63) 


0.706 (58) 


GANG (refined) 


0.730 


0.716 


0.740 


0.712 


0.705 


0.718 


Dhillon-Guan-Kulis 


0.696 


0.682 


0.712 


0.688 


0.670 


0.684 


Shi-Malik 


0.715 


0.692 


0.727 


0.690 


0.686 


0.694 


Blondel et al. 


0.873 (14) 


0.872 (13) 


0.885 (15) 


0.862 (14) 


0.856 (14) 


0.828 (21) 


GANG (refined) 


0.886 


0.879 


0.883 


0.873 


0.864 


0.840 


Dhillon-Guan-Kulis 


0.860 


0.865 


0.872 


0.859 


0.851 


0.810 


Shi-Malik 


0.877 


0.874 


0.877 


0.860 


0.870 


0.826 


GANG (refined) 


0.908 (10) 


0.945 (4) 


0.899 (12) 


0.880 (13) 


0.930 (5) 


0.907(9) 


Dhillon-Guan-Kulis 


0.889 


0.946 


0.897 


0.863 


0.916 


0.885 


Shi-Malik 


0.898 


0.951 


0.886 


0.863 


0.936 


0.902 



The clustering results corresponding to the curvature peaks include cluster(s) that contain 
nodes from both of the brain hemispheres and clusters that include nodes from only one of 
the hemispheres. In the following discussion, we call the clusters that contain nodes from 
both of the hemispheres, the central clusters. Graphs Al, B, D, and E include only one 
central cluster. Graphs A2 and G each include two central clusters. 

The regions of the brain that are commonly grouped in the central clusters are posterior 
cingulate cortex, precuneus, cuneus, paracentral lobule, pericalcarine cortex, caudal anterior 
cingulate cortex, isthmuscingulate, isthmus of the cingulate cortex, and lingual gyrus, pro- 
vided that graph B is excluded. If the second largest peak of the curvature is selected for B 
[k = 5), the same regions are assigned to its central cluster. The first five of the mentioned 



regions are also classified as part of the structural core proposed by Hagmann et al. [66|; Hag- 



mann et al. used cluster strengths, cluster degrees, k-Gore, s-Gore, betweenness centrality. 
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Tabic 4: Larger real networks: Partition metrics ( (Number of clusters) NAssoc per cluster/execution time). 



Network 


Gond-Mat 


Googleweb 


Amazon 


AS-Skitter 


# of Nodes/Edges 


36458/171736 


342408/1142134 


403364/2443311 


1694616/11094209 


Avg/Max Degree 


28.5/278 


6.7/1367 


12.1/2752 


13.1/35455 


GANG 


(2121) 0.784/0m2s 


(12213) 0.869/2m30s 


(11237) 0.768/3m4s 


(26399) 0.769/46m33s 


Dhillon- Guan-Kulis 


(2121) 0.669/0m3s 


(12213) — 


(11237) 0.679/6m0s 


(26399) — 


Rosvall-Bergstrom 


(2121) 0.746/OmlOs 


(12213) 0.844/lml7s 


(11237) 0.712/llm26s 


(26399) 0.720/99m37s 


GANG 


(82) 0.957/0m2s 


(239) 0.993/2m30s 


(230) 0.984/3m4s 


(1776) 0.933/46m33s 


Dhillon- Guan-Kulis 


(82) 0.784/<ls 


(239) 0.935/0m2s 


(230) 0.872/0m4s 


(1776) 0.709/8m0s 


Blondel et al. 


(82) 0.849/Omls 


(239) 0.990/0m5s 


(230) 0.952/0ml4s 


(1776) 0.859/0m4s 


GANG 


(37) 0.970/0m2s 


(7) 0.999/2m30s 


(22) 0.997/3m4s 


(32) 0.994/46m33s 


GANG 


(526) 0.902/0m2s 


(855) 0.984/2m30s 


(206) 0.996/3m4s 





and efficiencjo to propose a structural core of the brain consisting of eight regions. 



The authors of 66 



also used modularity maximization and found six modules, two of 
which included nodes from both of the hemispheres (central clusters). The regions that are 
assigned to the central clusters by modularity maximization are similar to the ones extracted 
by GANG. 

5.4-3. Larger networks 

Here we illustrate the performance of our algorithm on large graphs. We apply our 
algorithm to four networks with different natures: Gond-Mat (a collaboration network) (63] , 
Googleweb (a web graph) (6olpl . Amazon (a product co-purchasing network) [6^, and AS- 
Skitter (an autonomous system graph) j69j. The nodes in Gond-Mat represent scientists 
that submit their preprints to the condensed matter archive at 'www. arxiv. org' the edges 
represent co-authorships. The nodes in Googleweb represent websites and the directed edges 
represent the existence of hyperlinks. Each node in Amazon network corresponds to a 
product purchased at iwww . amazon . coml Each directed edge from a node to another means 
that when the former is purchased, the latter is frequently also purchased. AS-Skitter is a 
network of autonomous systems extracted by traceroute analysiso 

The Gond-Mat graph is weighted and the rest of the graphs are unweighted. In Table HI 
the average and maximum degrees do not take the weights into account; the number of edges 
affects the run-time of GANG, not the weight values. None of the above graphs are originally 
connected. However each has a very large connected component that includes the majority 
of the nodes and the edges. Here we conduct clustering analysis of the largest connected 
component. Some of these graphs are directed, but we construct undirected graphs by adding 
the adjacency matrix to its transpose. 



^^Cluster strength and degree are the weighted and unweighted degrees, respectively. k-Core (s-Core) 
is the largest subgraph that includes nodes of degree (respectively, strength) at least k (respectively, s). 
Betweenness centrality of a region is a measure of the proportion of the shortest paths passing through it. 
Efficiency of a region indicates how short the region's average path lengths to other regions are. 



^■^Googlc programming contest: http://www.google.coin/prograinining-contest/ 
^^ (http : //www . caida . org/tools/measurement/skitter 



Table S] compares the performance of the algorithms that are scalable to such large 
networks. The table shows the average NAssoc per cluster of the identified partitioning 
and the time required for completion of the algorithm. The spectral clustering algorithms 
cannot be executed on our test machine when either the number of nodes or the number 
of clusters is very large. Both the computational time and the memory requirements are 
excessive. We therefore compare the performance of GANG, Dhillon-Guan-Kulis (no local 
search), Rosvall-Bergstrom, and the Blondel et al. algorithms. 

The Rosvall-Bergstrom and Blondel et al. algorithms automatically select the number of 
clusters. A meaningful comparison of the average NAssoc is only possible when the number 
of clusters are equal; to facilitate comparisons, we therefore construct multiple partitionings 
for GANG and the Dhillon-Guan-Kulis algorithms, each with a different number of clusters. 
We also construct a partitioning corresponding to one or more of the peaks in the curvature 
plot (for some of the networks, there are two peaks that are very similar in value, so we 
consider it useful to examine the partitionings corresponding to each). The last two rows of 
Table H] correspond to those peaks. 

GANG is superior to other algorithms with respect to maximizing NAssoc in all of the 
cases. GANG also significantly outperforms the Dhillon-Guan-Kulis algorithm, which is also 
striving to maximize NAssoc. The latter is also outperformed by Rosvall-Bergstrom and 
Blondel et al. algorithms. 

In terms of the computation time, GANG is often faster than Rosvall-Bergstrom, but the 
scaling behavior is different. To illustrate this, in addition to the graphs listed in Tabled! we 



have applied the algorithms to the road network graph of Galifornia [60[ which is extremely 
sparse (the average degree is 2.8). The graph contains around 2 million nodes. While GANG 
performs the clustering in 32 seconds, in takes 197 minutes for the Rosvall-Bergstrom algo- 
rithm to converge to a solution. The complexity of GANG is dominated by the agglomerative 
clustering procedure which requires 0{mh log n) operations. Hence the speed of GANG de- 
pends on the number of edges in the graph, and the height of the dendrogram. However the 
graph sparsity does not show an impact on Rosvall-Bergstrom's execution time. Blondel's 
algorithm is much faster than GANG, but as mentioned previously, it suffers from resolution 
limit associated with clustering algorithms that maximize modularity. The Dhillon-Guan- 
Kulis algorithm is also very fast for small values of k, but it gets slower and its memory 
requirements become excessive if k becomes large. 

To illustrate the properties of the algorithm outputs in terms of the cluster sizes and 
the connectivity of individual clusters, we have examined the plots of NAssoc of individual 
clusters versus the number of nodes in them. Each algorithm behaves similarly on differ- 
ent graphs presented in Table HI Figure Ha] compares GANG and Rosvall-Bergstrom when 
applying these algorithms on Amazon co-purchasing graph. For the same number of clus- 
ters, GANG produces clusters with higher values of NAssoc than Rosvall-Bergstrom. The 
latter produces many small clusters with very low values of NAssoc. The behavior of the 
Dhillon-Guan-Kulis and the Blondel et al. algorithms are discussed in our case study. 

5.5. Case Study: US Patent Citation Graph 

As a case study, we consider the undirected version of the citation graph released by 



the National Bureau of Economic Research |70l. \7l\\. The patents are classified into 6 broad 



technological categories. A more refined classification leads to 36 sub-categories. We use 
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(a) Amazon (b) US Patent Citation 

Figure 4: NAssoc of individual clusters. 



each patent's label (category or sub-category) in addition to NAssoc and runtime to perform 
a comparison. 

The original graph is not connected; however the largest connected component contains 
more than 99.7% of the nodes and all of the edges. Hence we focus on the largest connected 
component which contains 3,764,117 nodes and 16,511,740 edges. The maximum node degree 
is 793. 



5.5.1. Clustering Runtime 

The Rosvall-Bergstrom algorithm j3| was terminated after more than 30 hours without 
converging to a solution. GANG takes 77 minutes to construct the full hierarchy. After the 
hierarchy has been generated, each flat partitioning including the refinement takes less than 



35 seconds. The algorithm by Blondel et al. |56| takes 8 minutes. The Dhillon-Guan-Kulis 



algorithm [ll| takes 72 seconds for k = 57 and increases as k is increased. 



5.5.2. Maximization of Normalized Association 

In order to have a fair comparison in terms of NAssoc, we fix = 57 to match the number 
of clusters of the Blondel et al. result. The values of NAssoc//c for GANG, Dhillon-Guan- 
Kulis, and Blondel et al. are 0.964, 0.859, and 0.855, respectively. The values of NAssoc 
of the individual clusters are shown in Figure |4b] which illustrates the clear superiority of 
GANG. 



5.5.3. Extraction of True Clusters and Absence of Large Well-Defined Clusters 

We use the categories and sub-categories to classify the nodes in the patent citation 
graph. We denote the maximum proportion of nodes from the same class in a given cluster 
as the homogeneity proportion of that cluster. For the Blondel et al. and Dhillon-Guan- 
Kulis algorithms, we use k = 57 and for GANG we use k = 52 (the closest peak of the 
curvature plot). The clusters are sorted according to their homogeneity proportion in Figure 
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[Sal The figure shows the superiority of GANG in extracting clusters of nodes from the same 
categories. When sub-categories are employed for the assessment, the superiority of GANG 
becomes more pronounced. 
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Figure 5: The 52 most homogeneous clusters out of 57 clusters using the Blondel et al. and the Dhillon- 
Guan-Kulis algorithms. 



Figure [5a] alone does not provide a fair comparison as any singleton would have homo- 
geneity proportion of 1. In Figure [5b| the homogeneity proportion is plotted versus the size 
of the extracted clusters for the three algorithms. 

The Blondel et al. algorithm produces clusters as small as 14 to as large as 250,000 
nodes. However the largest homogeneous cluster that is extracted has 38 nodes. The quality 
of clusters degrade as the size of the clusters increase. This trend is very likely to be due to 
the resolution limit of modularity. 



Both GANG and Dhillon-Guan-Kulis produce a very large cluster (the core cluster 60|). 
Excluding the core, Dhillon-Guan-Kulis produces clusters of an average size oi n/k which is 
around 37,000 nodes in this example (See Figure ISbl) . This behavior of the Dhillon-Guan- 
Kulis algorithm is due to the underlying region-growing procedure it adopts from Metis 
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The clusters extracted by GANG are much smaller than the other two algorithms. Ex- 
cluding the core, the cluster sizes are between 7 and 225. Similar to previous observations 
of Leskovec et al. 60| , the strongly-connected clusters of the large graphs listed in Table H] 



diminish in number as we move towards the center of the graph. Having a small homogene- 
ity proportion is then expected for the core as it includes several clusters that are not very 
well-connected and are lumped together. If we study a clustering that is in a lower level in 
the hierarchy, further clusters would be extracted from the core and the core shrinks. The 
resulting clusters are not as well-connected in terms of NAssoc though. In other words, the 
most isolated and well-connected clusters are extracted at the higher levels of the hierarchy. 
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6. Conclusion 



We proposed a novel algorithm to maximize normalized association that consists of three 
steps: the agglomerative hierarchical clustering procedure, the model order selection step, 
and the refinement step. 

The agglomerative clustering procedure requires O(nlog^n) operations for many real-life 
graphs, where n is the number of nodes in the graph. This procedure dominates the compu- 
tational complexity of the other steps of the algorithm. The second step of our algorithm, the 
model order selection, is based on the relative improvement of normalized association when 
changing the number of clusters; the curvature plot is used to select one or several model 
orders for the final clustering. Unlike modularity, the curvature metric does not exhibit an 
intrinsic resolution limit. For a multi-resolution analysis, a user can specify a range on the 
allowable number of clusters, and the algorithm will select the number of clusters with the 
maximum curvature in that range. After selecting the number of clusters, the clusters are 
passed to the refinement step. This step of our algorithm iterates over the boundary nodes in 
the clusters and explores possible improvements in normalized association by moving each 
of the boundary nodes to their neighboring clusters. Using the map data structure, the 
overhead added by the refinement becomes negligible. Experiments show that despite the 
negligible runtime of the refinement step, it significantly improves the initial results with 
respect to the normalized association maximization. 

Our experimental analysis on relatively small networks indicated that the proposed algo- 
rithm identified partitions that have normalized association values comparable to the spectral 
algorithms that involve an eigendecomposition. These algorithms are too computationally 
complex to be applied to very large graphs. We demonstrated that our proposed algorithm 
can be applied to large graphs (millions of nodes and edges). For these large graphs, the pro- 
posed algorithm identifies partitions that have larger values of normalized association than 
those identified by the only comparable algorithm that directly addresses the normalized cut 
metric. 

A clustering algorithm can by no means be suitable for every application. For example, 
despite the failure of Dhillon-Guan-Kulis algorithm to extract clusters with nodes of the same 
categories in the patent citation network, it generates clusters of very similar sizes. This is 



more suitable for VLSI applications for instance [72|. On the other hand, when clustering is 
meant to extract "communities" (group of nodes with strong intra-connection), GANG and 
Rosvall-Bergstrom are clearly preferred. 

The Blondel et al. and Rosvall-Bergstrom algorithms are not able to generate a parti- 
tioning to an arbitrary number of cluster^: even though they are agglomerative, they do 
not generate the complete hierarchy because they merge several nodes/clusters in each of 
their iterations. Hence if one is interested in several arbitrary clustering levels, GANG fits 
one's requirement the best out of the competing algorithms discussed. 

In this paper we only considered undirected graphs while the edge directions could carry 
valuable information about the structure of a graph. An extension of GANG can be developed 



by adopting the generalized normalized cut criterion [73 



^^Thc available implementation of the Rosvall-Bergstrom algorithm is based on the Blondel et al. algorithm. 
See http : / /www . tp . umu . se/~rosvall/ algorithm . pdf , 
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